Heart Disease Dataset

Heart Disease Dataset

In this article, we demonstrate solving a classification problem in TensorFlow using Estimators using the Heart Disease Dataset from the UCI Machine Learning Repository.

Picture Source: harvard.edu

Attribute Information:

  1. Age
  2. Sex
    • 0: Female
    • 1: Male
  3. Chest Pain Type
    • 1: Typical Angina
    • 2: Atypical Angina
    • 3: Non-Anginal Pain
    • 4: Asymptomatic
  1. Serum Cholestoral (in mg/dl )
  2. FBS: Fasting Blood Sugar > 120 mg/dl
    • 0 = False
    • 1 = True
  3. Resting Electrocardiographic Results
    • 0: normal
    • 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  4. Maximum Heart Rate Achieved
  5. Exercise Induced Angina
    • 0: No
    • 1: Yes
  6. Oldpeak = St Depression Induced By Exercise Relative To Rest
  7. Slope: The Slope Of The Peak Exercise ST Segment
    • 1: Upsloping
    • 2: Flat
    • 3: Downsloping
  8. Number Of Major Vessels (0-3) Colored By Flourosopy
  9. Thal
    • 3: Normal
    • 6: Fixed Defect
    • 7: Reversable Defect

Variable to be predicted

Problem Description

Developing a predictive model that can predict whether heart disease is present or absent based on the rest of the given features.

X and y sets

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: Tensorflow Boosted Trees Classifier with Feature Importance Analysis

Feature Columns

Create the feature columns, using the original numeric columns as is and one-hot-encoding categorical variables.

Input Function

The input function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. Moreover, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

Building the input pipeline

Training the model

ROC Curves

Feature Importance

We can investigate the feature importance of an artificial classification task. This is similar to that of scikit-learn and has been outlined in [6].

A nice property of DFCs is that the sum of the contributions + the bias is equal to the prediction for a given example.

Feature Importance for a patient

Plot DFCs for an individual patient which is color-coded based on the contributions' directionality and add the feature values on the figure.

Global feature importances

Gain-based feature importances

Gain-based feature importances are built into the TensorFlow Boosted Trees estimators using classifier.experimental_feature_importances.

Average absolute DFCs

Permutation feature importance


References

  1. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., Guppy, K.H., Lee, S. and Froelicher, V., 1989. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), pp.304-310.
  2. Aha, D. and Kibler, D., 1988. Instance-based prediction of heart-disease presence with the Cleveland database. University of California, 3(1), pp.3-2.
  3. Gennari, J.H., Langley, P. and Fisher, D., 1989. Models of incremental concept formation. Artificial intelligence, 40(1-3), pp.11-61.
  4. Regression analysis Wikipedia page
  5. Tensorflow tutorials
  6. TensorFlow Boosted Trees Classifier
  7. Lasso (statistics) Wikipedia page)
  8. Tikhonov regularizationm Wikipedia page
  9. Palczewska A., Palczewski J., Marchese Robinson R., Neagu D. (2014) Interpreting Random Forest Classification Models Using a Feature Contribution Method. In: Bouabana-Tebibel T., Rubin S. (eds) Integration of Reusable Systems. Advances in Intelligent Systems and Computing, vol 263. Springer, Cham
  10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.
  11. Jordi Warmenhoven, ISLR-python
  12. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R